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Abstract - This paper investigate a VLSI architecture for robot direct kine- 
matic computation suitable for industrial robot manipulators The Denavit- 
Hartenberg transformations are reviewed to exploit a proper processing ele- 
ment, namely an augmented CORDIC. Specifically, two distinct implementa- 
tions are elaborated on, such as the bit-serial and parallel. Performance of each 
scheme is analyzed with respect to the time to compute one location of the 
end-effector of a 6-links manipulator, and the number of transistors required. 


1 CORDIC Techniques 


The matrix Aj describing the jth link is proposed to be implemented via 4 CORDICs: 
parallel two for the w-axis operation, and another parallel two for the x-axis. Since, the 
rotation and translation are disjoint each other, the 4 CORDIC can be done via a 2-stages 
cascade [5]. 

Let the jth joint orientation vector denote by pj, where pj = AjPj-\. Consider an 
intermediate vector p * , between pj and Pj-\. 

Pj = Trans(wj-i,dj)Rot(wj-i,6j)pf : stage — 1 (1) 

p* = Trans(xj ,aj)Rot(xj,il>j)pj-i : stage — 2. (2) 

One set of transformations for each stage, i.e. Trans(w, d)Rot(w , 6 ), is a block-diagonal 
matrix and can be orthogonally implementable by two 2x2 matrix transformations. Note 
that is implementable through an augmented PE, rather two different PEs, observing that 
Trans(w, d) is a trivial operation. Then, 
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6 . 2.2 


Pj (also, similarly for pj) is decomposed into two blocks, e.g the first two elements of pj 
becomes one vector Xj : 


Pj pCjjWj,!] \Rot[wj y Wj + dj, 1], 


( 4 ) 


where Wj is for the w-axis component of the vector pj , and Xj for x- and y-axis components 
rotated by 6j. In a similar way, for pj we can choose a rotated vector of y- and w-axis 
disjointly through axis shuffling. Finally, consecutive n-pairs of rotation and translation 
can be implemented via a 2n-stages cascade. We wIU name each stage as a macro-PE 
(or, an augmented PE), which can be 2n-pipelined to compose an n-links computation 
processor. Not to differentiate the two different sets of transformations, w-axis and x-axis 
respectively, we employ index i in unifiecT notations for a macro-PE: for a reference axis 
Wj, there are rotation of 6{ and translation of dj, -Xj_i = (xj_i,yj_i) for an input, and 
Xi — (xi,yi) for an output. I~~ ~ s 

Each macro-PE including one Trans(wi,di ) and one iZot(wj,dj) can be implemented, 
as in Figure l.a. One-joint processor is shown in Figure l.b. Finally, fbr a 5-joints system, 
Figure l.c shows a fully pipe lined s tructure. _ . . 

From this point, we will concentrate on implementation of a macro-PE. Observing 
that Rot and T r rans functions are disjoint each other, lei us isolate the rotation pari at 
first. This vector rotation for X, = (x;, y t ) by the angle 0* can be realized by an iteration 
algorithm called CORDIC [4] instead of computing trigonometric functions and applying 
matrix multiplication. CORDIC realizes a vector rotation by a partial sum of micro-angle 
rotations with a pre-fixed sequence of angles. When the rotation macro-angle is represenied 
as a sum of decomposed micro-angles, i.e 0; = Ylk=o 


^ = n ** 

kj= 0 


1 

tan9 ly k 


— tandi 


1 v 

^ Ai - 1 


( 5 ) 


where k * = cosOi is a micro-scale composing a final scale factor, explained later. Such 
a specific form of the pre-fixed micro-angle sequence as tan" 1 2“% is attractive for VLSI 
implementation since it is composed only of additions, shiftings, and a arctangent lookup 
table For the simplicity of notation, subscript i indexing a certain stage will be omitted, 
and X,F and Z stand for abridged notations for those having subscript i. 

Non-redundant : The micro-iterations of the conventional (hereafter, it will be called 
non- redundant ) CORDIC are 3 Unear recursive equations: X recurrence (X-rec.), Y- 
recurrence (Y-rec.) and Z-recurrence (Z-rec.) [4]7 


( 6 ) 


X(t + l] = X[i] + <7j2-Y[i] 
Y\i + 1] = 7[i] - <Tj2~‘X[i] 


1 


Z[i + 1] = Z[i] — <7 { tan 1 2 


With an initial value of Z[ 0] ~ 0^, CORDIC rotates initial values of X[0] and Y[ 0], to the 
last value X[n ] and Y[n], while making Z[i] close to zero, so that Z[n ] is forced to be zero. 
With n number of iterations, n-bit accuracy of X and Y in the output can be achieved. 
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Figure 1: CORDIC-based Pipelined Architecture for Direct Kinematics Computation: a. 
A macro-PE, One-stage from an orientation to an intermediate, b. 2-stages cascade, An 
A{ transformation module for a link, c. A complete pipelined Computation Module for 
6-links system. 
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For a known angle, the direction of the rotation, cr^ can be pre-computed or calculated one 
by one on-the-fly using the following selection function. 


f 1 if Z[i\ > 0 
\ -1 if Z[i] < 0 


( 7 ) 


The CORDIC rotation does not preserve the input norm. To get a rotated vector having 
the same length as the input (X[0], Y[0]), X[n](Y[n]) needs to be compensated by a scaling 
factor K 


where || • || stands for the norm of the vector. Note that K is constant for the non-redundant 
scheme since <t» is in {-1, 1}. 

Redundant ? Non-redundant CORDIC is slow inherently with delay of 0(n 2 ) due to 
its recursiveness and serial dependency, since a micro-rotation with delay O(n) should be 
finished before processing the next micro-rotation. Delay performance of a macro-rotation 
(n micro-rotations) can be improved from 0(n 2 ) to 0(n) by using redundant arithmetic 
(carry-free addition such as carry save or signed-digit addition) to determine the direction 
of the rotation <5\, based on an estimate instead of an exact value [9]. The redundant 
arithmetic gives a delay of 0(1) instead of 0(n), and the estimation of direction is necessary 
not to erode the advantage of 0(1). This requires the modification of the recurrences and 
selection function. This redundant CORDIC scheme produces the output about 4 times 
faster than the non-redundant. However, it introduces additional cost since the scale factor 
K is variable depending on a macro-angle by allowing d* to be in {-1, 0, 1}. 

Constant-Factor-Redundant : To reduce implementation cost of redundant CORDIC, 
it would be good to have a constant scale factor by forcing &i in {-1, 1}. However, since d, 
is determined from an estimate, there arises a convergence assurance question. There was 
proposed a scheme appending correcting iteration stages at proper positions [10]. Along 
to this idea, the number of extra correcting iterations is further reduced by dividing the 
micro-iterations (for i = 0 to i = n — 1) into two groups: one group where the direction of 
the rotation is in {-1, 1} for i = 0 to i = n/2 and the other in {-1, 0, 1} for i = (n + l)/2 
to i = n — 1 correcting iterations by 50 % since correcting iteration is not needed for the 
second half of the micro-iterations and we still obtain a constant scale factor K since the 
value of K in n-bit precision does not depend on the b value for (n + l)/2 < i < (n — 1). Z- 
recurrence also can be modified so that b$ is determined quickly by looking at a few most 
significant bits. This new scheme is called Constant-Factor-Redundant-CORDIC(CFR- 
CORDIC). The modified recurrences and selection functions for the scheme are described 
below. 
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t=0 


X[i + 1] = X[i] + ^2-'y[»] 
Y\i + 1] = Y[i] - bi2~'X\i} 
U[i + 1] = 2(17 [i] - bi tan" 1 2“*) 


( 9 ) 
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where U\i] is for the implementation simplicity, which is equal to 2 l Z[i], and the selection 
function is given as follows: 


<7j = 



if U\i] > 0 

or U[i) = 0 fl i < n/2 
u[i] = o n i > n/2 
if U[i) < 0 


( 10 ) 


When t fractional bits are used in the estimate value, i.e., t/[t] is computed using t 
fractional bits of redundant representation of U[i\, the following correcting iteration need 
to be included, where the interval between indexes of correcting iterations should be less 
than or equal to (t — 1) up to the last iteration index equal to n/2. When the correction 
stage is necessary at the jth step of micro-iteration, 


u c [j + 1] = U\j + 1] - 2& < /2 j tan- j 2- j 


( 11 ) 


with the direction of the rotation a f determined from the same selection function of 
eq.( 10), except being decided based on U[j + 1] instead of I7[i]. 

So far, we discussed about recursive structures of several CORDIC schemes to imple- 
ment the rotation part in the basic PE, as depicted in Figure 1. The PE, augmented by a 
translator, necessitates scaling operation at each stage, because shuffling of the output at 
each stage makes continuous accumulation of the scaling factor complex to be processed 
at the final stage. The scaling operation has been solved either by an explicit way or an 
implicit. The explicit way is dividing the rotated vector by a constant, which is known for 
the non-redundant, to be calculated while running the micro-steps of CORDIC [4,9]. The 
division can be processed by another CORDIC (in a linear mode) or a divider. The implicit 
approach reconfigures the sequence of micro-iterations of the CORDIC, eventually to have 
a different norm from that without scaling micro-iterations. Scaling micro-iterations target 
in general at making the adjusted scaling factor in a form of 2* or 1, which can be easily set 
to the unit size. Each micro-iteration can be composed of i) reduction axis-scaling [11], 
ii) repetition of vector-scaling, iii) expansion axis-scaling or combinations thereof [12]. 
Relevant issues regarding solution search are to be further studied, more than the greedy 
method or the decomposed [13]. In summary, the explicit scaling almost doubles the 
system complexity, while the implicit increases 25 % for the non-redundant and about 30 
% for the redundant. 


2 Application to Direct Kinematics 

In this section, we design an architecture for the direct kinematics computation, based on 
CFR-CORDIC. The data-path is the parallel. To analyze its performance, we will define a 
new measure, namely one-position calculation time. Via this measure, we will also analyze 
performance the bit serial architecture similarly implementable as in 
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2,1 Performance Measure 

Let’s define the following parameters. 

bi : the number of bits in each input x,y and w 

bf : the number of bits in each output 

n/ : the number of links (=6) 

f c : the available data shift rate 

A : the step time per micro CORDIC iteration 

fi : the input bit rate 

Additionally, we define a measure parameter T A , 

T& = step-time(A) * number of steps, 

to compare the performance of various schemes. For a discrete element implementation, 
A corresponds to one single external clock time l// c . Note that A varies depending T 6n a 
particular implementation of a macro-PE. Without loss of generality, let’s define the unit 
of A to be 1 for one-bit full addition time. The input processing rate can be alternatively 
interpreted as 


which limits the maximum rate of input vector sampling to be processable through an 
implemented processor. 

2.2 Performance Comparison ^ • 

Bit Serial: A macro-PE using serial data path and arithmetic units for CORDIC is shown 
in Figure 2 [6]. Figure 2. a shows symmetric components of a bit-serial PE in x, y and w 
representation, and Figure 2.b is for the detail of each block (X-recurrence or Y-recurrence) 
employing bit serial arithmetic. W-recurrence is in Figure 2.c, and Z-recurrence in Figure 
2.d. The x and y components of the input vector X,_i are taken initially as X[0] and 
Y[0], and the initial angle Z[ 0] is set to the corresponding joint angle. After performing n 
micro-iterations, CORDIC produces n-bit precision outputs leading to AT;. 

In the serial scheme without macro-pipelining, denote a basic step-time as Ai, w hich 
is equivalent to A. To use one adder recursively rtf times to process an Uf links, 

= Ai * n f{hf + bi(bi + log^bi)), 

where the output has bf bits buffer. 

CFR-Redundant Parallel : To increase the throughput of the previous, the bit- 
serial PEs can be substituted by those using parallel arithmetic. When parallel arithmetic 
and non-redundant CORDIC are adopted, the corresponding parameter becomes 

Ta 2 = A 2 * n f (bi + log 2 bi) 

where A 2 equals to the time for one micro-rotation (time for variable shifter plus time for 
carry-propagate addition), approximately 21og 2 6, assuming fast variable shifter and carry- 
propagate adder. The step time can be further shortened by adopting CFR-CORDIC, 


I<_L 

bi T A ’ 


( 12 ) 
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x i 

(2.b) 



Figure 2: A bit-serial PE : a. A macro-PE with X-, Y- and W-recurrence, b. Detail of 
either block, c. W-recurrence, d. Z-recurrence. 










' 8/15 

where a carry-free adder (signed-digit adder) is replaced for carry-propagate adder. Figure 
3. a shb#$ a riia*cro-PE ixi components, and Figure 3.b is for the detail of each block (X- 
rfecurrehce or Y- recurrence) employing parallel/redundant arithmetic. Z-recUttence is in 
Figure 3.c. 



y _ 
t5.a) 



Figure 3: A parallel/redundant PE: a. A macro-PE with X- and Y-recurrence, b. Detail 
of either block, c. Z-recurrence. 
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Description 

Aj/A 

Ti, 

Processing 

rate 

TRs 

estimate 

Bit-serial 

1 

1200 A 

6ook 

2 K 

(parallel) 



4M 

12 K 

Parallel(CFR) 

5 

500A 

2M 

6 K 

(parallel) 



10M 

40K 


Table 1: Time and complexity comparison 


In this case, the sign of Z[i] at the zth micro-iteration can not be detected by looking 
at the most significant bit since Z[i ] is in redundant number representation. To determine 
the sign of Z[i] quickly by looking at a few significant bits, CFR-CORDIC uses an estimate 
of shifted- Z[i] (Cf [»]) using t fractional bits. As discussed earlier, the number of fractional 
bits used for the estimate also determines the frequency rate of a correcting iteration: more 
fractional bits are used, less number of correcting iterations are required. Let the number 
of correcting iterations denoted by 77. The corresponding T Aj becomes 

T Aj = A 3 * n f (bi + log 2 b { + rj) 

where A 3 equals to the time for carry-free addition plus the time for the maximum of a 
selection function and a variable shifter, approximately (1 + log 2 bi). Note that a practical 
number of correcting iterations is much smaller than 6 ;, e.g. 1 for the 16bit resolution. 
Hence, we can approximate T Aj to be that for the redundant without a correcting iteration. 

For a case, = 12 , b f = 16, the estimated T A is summarized in Table 1 . To get first 
order estimates of available speed and area, we use a figure that one full adder (also one 
bit shifter) requires approximately 50 TRs and one 20 nsec clock cycle [14]. 


3 Conclusion 

We have examined various kind of CORDIC schemes as a macro-PE module for the 
direct kinematics processor, and showed that its micro-level regularity is suitable for 
VLSI implementation, depicted along with specific schematics which include the conven- 
tional non- redundant , the redundant and the Constant-Factor- Redundant schemes. The 
cost-effectiveness of selected architectures has been analyzed using bit-serial, parallel or 
pipelined structure with respect to the time and the number of modules required, to 
compute one location of the end-effector for a 6 -links manipulator, given a set of angle 
measurements The comparison table exhibits the CORDIC-based robotics processor as a 
prospective solution in VLSI to be used for a wide range of kinematics calculation require- 
ment, compromising the size versus speed. 
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